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Abstract: Given a matrix A of size m x n, the manuscript describes a algorithm 
for computing a QR factorization AP = QR where P is a permutation matrix, Q is 
orthonormal, and R is upper triangular. The algorithm is blocked, to allow it to be 
implemented efficiently. The need for single vector pivoting in classical algorithms 
for computing QR factorizations is avoided by the use of randomized sampling to 
find blocks of pivot vectors at once. The advantage of blocking becomes particularly 
pronounced when A is very large, and possibly stored out-of-core, or on a distributed 
memory machine. The manuscript also describes a generalization of the QR factor¬ 
ization where we allow P to be a general orthonormal matrix. In this setting, one 
can at moderate cost compute a rank-revealing factorization where the mass of R is 
concentrated to the diagonal entries. Moreover, the diagonal entries of R closely ap¬ 
proximate the singular values of A. The algorithms described have asymptotic flop 
count 0{mn min(m, n)), just like classical deterministic methods. The scaling con¬ 
stant is slightly higher than those of classical techniques, but this is more than made 
up for by reduced communication and the ability to block the computation. 


1. Introduction 

Given an m x n matrix A, with m > n, the classical QR factorization takes the form 

A = Q R P* 

m x n m x i i x n n x n 

where Q is orthonormal, R is upper triangular, and P is a permutation matrix. The inner dimension i can 
be either i = n for an “economy size” factorization, or £ = m for a “full factorization.” A standard way 
of computing the QR factorization is to successively drive A towards upper triangular form by applying 
a sequence of Householder reflectors from the left (encoded in Q). To ensure that the diagonal entries of 
R decay in magnitude, it is common to use column pivoting, which can be viewed as applying a sequence 
of permutation matrices representing column swaps to A from the right (encoded in P). 

In this manuscript, we use randomized sampling to improve upon the performance of the classical House¬ 
holder QR factorization with column pivoting in two ways: 

(1) We will enable blocking of the algorithm. The resulting algorithm interacts with A via a sequence 
of BLAS3 operations. 

(2) We will describe a variation of the QR factorization for which the diagonal entries of R tend to 
be very close approximations to the singular values of A. Such a factorization is commonly called 
a Rank-Revealing QR factorization. 

The computational gains from eliminating column pivoting will be the most pronounced for very large 
matrices, in particular ones stored on distributed memory systems, or out-of-core. Likewise, the RRQR we 
present is particularly competitive for matrices large enough that computing a full SVD is not economical. 
Observe that the methods presented will have higher flop counts than classical methods. The benefit is 
that they need less data movement. 
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2. Notation 

We follow [4] for general matrix notation. Given a matrix A, we let A* denote its transpose when A is 
real, and its adjoint when A is complex. We let 1^ denote the k x k identity matrix. 

The algorithms presented can advantageously be implemented using standard libraries for computing 
matrix factorizations, or matrix-matrix multiplications. This simplifies coding since it makes the codes 
portable, and allows us to fully benefit from highly optimized routines for standard computations. When 
estimating computational costs, we use the following simplified model: We assume that the cost of 
multiplying two matrices of sizes l x m and m x n is 

Cmm Imn. 

The cost of performing a full SVD or QR factorization of a matrix of size m x n, with m > n, is 

C qT mn 2 , and C sv dmn 2 . 

We will sometimes use non-pivoted QR factorizations. We assume that the cost of this is 

C™ piv mn 2 . 

Typically, Cq° piv is smaller that C qv . 

Remark 1. Observe that when computing a pivoted QR factorization of a matrix A of size m x b 
where m > b, it is always possible to first do a non-pivoted factorization, and then do a small pivoted 
factorization on a matrix of size 6x6. To be precise, we first factorize 

A = Q' R' 

m x 6 mx 6 6x6 

with no pivoting. This is perfectly stable since Q r is built as a product of ON transforms. Then perform 
a pivoted QR factorization of the small square matrix R 7 , 

R' P = Q" R. 

6x6 6x6 6x6 6x6 

Finally, simply set Q = to obtain the factorization (1). 

3. Review of QR factorization using Householder reflectors 

The algorithm presented in this manuscript is an evolution of the classical technique for computing a QR 
factorization via column pivoting and a sequence of Householder reflectors, see, e.g., [4, Sec. 5.2], In this 
section, we briefly review this technique and introduce some notation. Throughout the section, A is a 
real matrix of size m x n, with m >n. The generalization to complex matrices is trivial. 

3.1. Householder reflectors. Given a vector a £ M fc , the associated Householder reflector H = H(a) is 
the k x k unitary matrix defined by 

H = l — 2vv*, where v = t/— 11 —^tt, and where (3 = — sign(a(l)) llall. 

11/5 e i a || 

The Householder reflector maps a to a vector whose entire mass is concentrated to its first entry: 

± ||a|| 

0 

Ha = 0 


0 
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Figure 1. Example of how a 4 x 4 matrix A is driven to upper triangular form in a 
classical QR factorization. Each figure shows the sparsity pattern of the matrix A y , with 
notation as in Section 3. 


3.2. QR factorization without pivoting. In this section, we describe how to drive the given matrix 
A to upper triangular form by applying a sequence n — 1 Householder reflectors from the left. We set 
Ao = A, and let A j denote the result of the first j steps of the process. The sparsity patterns of these 
matrices are shown in Figure 1. In practice, each A j simply overwrites Aj_i. 

Step 1: Let H denote the Householder reflector associated with the first column of Ao (shown in red in 
Figure 1). Set Q (1 ^ = H. Then applying Q (1) = Q (1 )* to Ao will have the effect of “zeroing out” out 
elements below the diagonal in the first column. We set 


( 2 ) 


Ao = Q«*A 0 = 


m 

o 

o 

o 


r\2 

a 22 

a 32 

a 42 


H3 

a 23 

a 33 

a 43 


r\A 

a 24 

a 34 

a 44 


where rn = —sign(Ao(l, 1)) ||Ao(:, 1)||. (Note: In equation (2), the matrix is symmetric, but we 
include the transpose to keep notation consistent with later formulas involving non-symmetric matrices.) 


Step 2: Let H be the Householder reflector of size (n — 1) x (n 
(shown in blue in Figure 1). Set 


Q( 2 ) 


Ii 0 
0 H 


1) associated with the vector Ai(2 : n, 2) 


The effect of applying Q (2 )* = Q* 2 ) to Ai will then be to “zero out” all sub-diagonal elements in the 
second column. We define 


( 3 ) 


A 2 = Q( 2 )*Ai = 


Hi ri 2 r 13 
0 r 22 r 23 

0 
0 


0 

0 


a' 


33 


a' 


43 


r 14 

r 2 4 

n" 
“ 34 

n" 

“44 


where r 22 = —sign(Ai(2,2)) ||A 2 (2 : n, 2)||. 


Step j = 3, 4, ..., n — 1: Continue just as in Step 2, zeroing out all elements of Aj_i below the diagonal 
in the j’th column, using the Householder reflector associated with the vector Aj_i(j : m,j). 


Once all steps have been completed, observe that A n _i is upper triangular, and that we have now con¬ 
structed a factorization 


i 



qO-1)* q(™-2)* . . . q(2)* q(1)* fa 

^ ^ ^ 

= :Q* 


= :R 
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• Set Q = I and P = I. 

• for j = 1, 2, 3,..., n — 1 

— Let j' denote the index of the largest column in A (j : m,j : n ), and let P 
denote the permutation matrix that swaps columns j and f. 

— Let H denote the Householder reflector associated with the vector A (j : m,f), 

and set Q = 1 ^ 

— Update A, Q, and P via 

A <- QAP, 

Q - QQ, 

P <- PP. 

end for 


Figure 2. The algorithm QR. Given an input matrix A of size m x n, with m > n, the 
algorithm produces a factorization AP = QR with R upper triangular, Q orthonormal, 
and P a permutation, cf. (1). The matrix R overwrites A. 

3.3. Column pivoting. It is often desirable that the diagonal entries of R should form a decreasing 
sequence 

(4) |R(1,1)1 > |R(2, 2)1 > |R(3,3)| > ••• . 

This can be obtained by introducing pivoting into the scheme described in Section 3.2. The only modifi¬ 
cation required is that we now need to also hit A with ON-transforms from the right. 

To be precise, at the start of step j, let j' denote the index of the largest column of A ? _i, among the 
“remaining” columns j, j +1, ..., n. Then let P^ denote the permutation matrix that swaps the columns 
j and f. Then in the matrix Aj_iP^, the j column will have the largest column norm among the columns 
in Aj„i(j : m,j : n). Determine the Householder reflector so that it “zeros out” the sub-diagonal 
entries in the j’th column of Aj_iP^\ and then set 

A, := QW’A^i PW). 

Once all n — 1 steps have been completed, we end up with a factorization 

(5) A„_1 = Q( n_1 )* qO-2)* . . . q( 2 )* q! 1 )* A p(!) p(2) . . . p(n—2) p(n-l) _ 

-V—^ V s/ • S Ss -v-^ 

= :R =:Q* = :P 

Left multiplying (5) by Q = Q (1) Q (2 ^ • • • Q^ _1 l yields the factorization (1). 

Figure 2 summarizes the classical Householder QR algorithm. 

4. Block pivoting using randomized sampling 

The QR factorization algorithm based on Householder reflectors described in Section 3 is exceptionally 
stable and accurate, but has a shortcoming in that it is hard to block. The traditional column pivoting 
strategy described necessarily must proceed one vector at a time, since you cannot find the j pivot column 
until after the (j — l)’th Householder transform has already been applied. In this section, we address 
the task of how to find batches of b pivot vectors at once. Conceptually, we seek to determine a set of b 
vectors whose spanning volume is maximal, among the remaining columns. 

Suppose we are given a matrix A of size m x n, and let b be a block size. We will describe two related 
techniques for computing an ON matrix S of size n X n such that the first b columns of AS form good 
choices for the first b “pivot columns” in a QR factorization. In Section 4.1, we describe a technique for 
constructing a matrix S that is a permutation matrix, so that the first k columns of AS are simply k 
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chosen columns from A. In Section 4.2 we generalize slightly further from the classical QR factorization, 
and will describe a “pivoting matrix” S that is ON, but is not merely a permutation (in fact, it will consist 
of a sequence of Householder reflectors). In other words, each of the first k columns of AS will consist of 
linear combinations of columns of A. 


4.1. Finding a permutation matrix S. Draw a Gaussian random matrix Q. of size mxb, and compute 
a “sampling matrix” Y via 


( 6 ) 


y = a* n. 

n x b n x m mxb 


Then one can prove that with very high probability, the linear dependencies among the columns of Y 
closely match the linear dependencies among the columns of A. This means that if we perform a classical 
QR factorization 


( 7 ) 


Y* S = Q R, 

b x n n x n b x b b x n 


then the permutation matrix S chosen will also be a good permutation matrix for A. The cost of finding 
this matrix S is 

Cm m rrinb + C qi nb 2 . 


4.2. Finding a general orthonormal matrix S. Construct a Gaussian random matrix Q of size mxb, 
and again compute a “sampling matrix” Y = A*fi, cf. (6). Next, perform a QR factorization of Y to 
form the factorization 


( 8 ) 


Y P = S R. 

n x b b x b n x n n x b 


Comparing (8) to (7) we observe that (8) represents a orthonormalization of the columns of Y, while (7) 
orthonormalizes the rows of Y. Now observe that 


In other words, S* is on orthonormal map that rotates all the mass in Y into the first k rows. This means 
that in the matrix AS, the leading b columns approximately span the same space as the leading b left 
singular vectors of A, which would form the “ideal” pivoting vectors. 


Remark 2. One can prove that in computing the factorization (8), it is possible to forgo pivoting entirely 
(so that P = I), which will accelerate this computation. 

Remark 3. The matrix S described in this subsection is a large (of size nxn ) ON matrix. Note, however, 
that it is composed simply of a product of b Householder reflectors. This means that the factorization (8) 
can be computed in 0(nb 2 ) operations, S requires 0(nb ) storage, and can be applied to a vector using 
0(nb ) flops. 


4.3. Over-sampling. The accuracy of the procedures described in this section can be improved by 
constructing a few “extra samples.” Let r denote an over-sampling parameter. The choice r = 10 is often 
excellent, and r = b leads to very high accuracy. Then generate a Gaussian matrix £2 of size m x (b + r). 
Then, in computing the factorizations (7) or (8), execute only the first b steps of the QR process. 


5. Blocked QR 

We describe a blocked version of the basic Householder QR algorithm (cf. Section 3) in Section 5.1. The 
block pivoting can be done using either permutation matrices (cf. Section 4.1) or Householder reflectors 
(cf. Section 4.2). The two resulting algorithms are analyzing in Sections 5.2 and 5.3, respectively. 
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• Set Q = I and P = I. 

• for i = 1,2,3,... ,p — 1 

— Partition the index vector / = [I\ I 2 I 3 ] so that I 2 = b{i — 1) + (1 : b) is the 
block processed in step i. Then partition the matrix accordingly, 



' An 

A12 

A13 

A = 

0 

A22 

A23 


0 

A32 

A33 


(Observe that in the first step, I\ = [] and A has only 2x2 blocks.) 

— Determine a pivoting matrix P by processing A ([/2 P 3 ], [I2 ^ 3 ]), as described 
in Section 4. Then update the last two block columns of A accordingly, 


Al 2 

_> 

CO 

_ 1 


1 

> 

to 

> 

CO 

_ 1 

A22 

> 

to ^ 

CO 

= 

A22 

A23 

1 

> 

CO " 

to 

A33 


1 

> 

CO 

to 

> 

CO 

CO 

1 _ 


— Execute a QR-factorization 


A 22 
A 23 
m! x b 


P 

bxb 


Q 

m! x m! 


R22 

0 J ’ 

m! x b 


where m! = m — (i — 1)6 is the length of the index vector [I 2 I 3 ]. Observe that 

while Q is large, it consists simply of a product of b Householder reflectors. 

// 

23 


— Compute the new blocks A 23 and A 33 via 

— Update the matrices A, Q, and P, via 


A" 


A" 

rt 33 


Q 


A' 


23 

^33 


Q • Q 


An 

0 

0 

I 


m 

0 


'n 1 

0 


Ai 2 P 

R22 

0 

0 
Q 


0 

P 


A13 

A'4 

A" 3 


1 

Ini 

0 

0 ' 


0 

p 

0 


0 

0 

1^3 . 


end for 

At this point, all that remains is to process the lower right bxb block. Partition 

An A12 

0 A22 

so that A 22 is of size bxb. Observe that An is already upper triangular. Now 
compute a QR factorization QR22 = A22P- Then simply update 


An 

0 


A12 

R22 


Q = Q 


n 1 

0 


0 

Q 


n 1 

0 


0 

p 


Figure 3. The algorithm blockQR. Given an input matrix A of size m x n, with m > n, 
the algorithm produces a factorization AP = QR with R upper triangular, and P and Q 
orthonormal, cf. (1). The matrix R overwrites A. 


5.1. The algorithm. Suppose that A is an m x n matrix, with rn > n. We will drive A to upper 
triangular form by processing blocks of b vectors at a time. Suppose for simplicity that n is a multiple of 
b, so that n = bp for some integer p. The blocked algorithm resulting is shown in detail in Figure 3. The 
non-zero elements in the matrix A y obtained after j steps of the blocked algorithm is shown in Figure 4. 
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Figure 4. Example of how a 12 x 12 matrix A is driven to upper triangular form in the 
blocked randomized QR factorization. Each figure shows the sparsity pattern of matrix 
A j, with notation as in Section 5. 


For efficiency, it is essential that all manipulations involving ON matrices be executed using the fact 
that these are products of Householder reflectors. Observe that if an n x n matrix U is a product of 6 
Householder reflectors, then U admits the representation 

U = I + W Y 

n x n n x n n X b b x n 

for some matrices W and Y, see [1]. This should be exploited whenever a matrix such as U is applied to 

A. 

5.2. Block pivoting via permutation matrices. Suppose that we build the pivoting matrix P in 
blockQR as a permutation matrix, using the strategy of Section 4.1. Observe that this step still requires 
“single-vector” column pivoting. The gain here is that the pivoted factorization is executed on a small 
matrices of size b x n. A secondary pivoted factorization is also executed to build the matrices Q and P, 
but again, the matrix involved is of size (m — (i — 1)6) x 6 . 

The key point is that the interaction with A is done exclusively via matrix-matrix multiplications. This 
leads to a modest acceleration when A fits in RAM but is large. When A is stored on a distributed 
memory machine, or is stored out-of-core, we expect a decisive speed-up. 

5.3. Block pivoting via Householder reflectors. The asymptotic cost of the algorithm using House¬ 
holder reflectors instead of permutation matrices has the same scaling as the algorithm based on per¬ 
mutation matrices, although the constants are slightly larger. The benefit of this algorithm is that the 
pivoting is “better” in the sense that more of the mass is concentrated to the diagonal blocks, as we will 
discuss in detail in Section 6. 


6 . Rank-Revealing QR factorization 

The idea of Rank-Revealing QR factorization (RRQR) [5, 2] is to combine some of the advantages of 
QR factorizations and singular value decompositions. While the SVD is excellent for revealing how well 
a given matrix can be approximated by a matrix of low rank, it can in principle only be computed 
via iterative procedures. In practice, the best iterative schemes converge fast enough that they in most 
circumstances behave just like deterministic algorithms. The idea of an RRQR is that it can be computed 
using non-iterative methods (typically substantially faster than a full SVD) and reveals rank almost as 
well as an SVD. 


6.1. Low rank approximation. Let A be of size mxn , with m > n. The Singular Value Decomposition 
(SVD) of A takes the form 


= U D V*, 


( 9 ) 


A 

mxn 


mxn n x n n x n 




























































where U and V are orthonormal. The diagonal matrix D has as its diagonal entries the singular values 
{o‘j}" =1 , ordered so that o\ > a -2 > ■ ■ ■ > cr n > 0. The Eckart-Young theorem [3] states that when matrices 
are measured in either the spectral or the Frobenius norm, then the truncated SVD is an optimal low-rank 
approximation to a matrix. To be precise, fix a rank k , and partition 

<“> A = Y [ D o u d ° 2 ][$,[ 

so that Du is of size k x k. Then in the spectral norm, we have 

min{||A — B|j : B has rank k} = ||A — UiDnVi|| = HU 2 D 22 V 2 II = 11D 22 11 = &k+ 1 - 


The statement for the Frobenius norm is analogous: 


min{||A — B||fi-o : B has rank k} = ||A — UiDuViUfto = ||U 2 D 22 V^ 


2 11 Pro = || L'22 11 Pro = 


£■ 

j=k+1 


Now suppose that we do the same thing for a QR factorization 


AP = [Qi Q 2 


Rll R 12 

0 R 2 2 


Then the error in a rank-A; approximation would be 

I! AP — Qi [R 11 R 12 ] || = ||Q 2 R 22 1| = ||R 22 1| • 

In order for the error in the truncated QR to be close to optimal, we would like to have that || R 22 1| ~ || D 22 1| • 
Enforcing this for every choice of A:, we find that we seek 

(11) || R((fc + 1) : n, (k + 1) : n)|| ps ||D((A; + 1) : n, (k + 1) : n)||, k = 1, 2, ...,n - 1. 

A closely related condition is the slightly stronger, and simpler, condition that the diagonal entries of R 
should all approximate the corresponding singular values, so that 


|R(A;, A:)| « a k , 


k = 1, 2, ..., n. 


Informally, the idea is to move as much mass as possible in the matrix R onto the diagonal entries. 

In traditional QR factorizations, one typically assumes that the matrix P is a permutation matrix, so 
that the columns of Q(:, 1 : k ) provide an orthonormal basis for a selection of k columns of A. Under 
this condition, it is possible to construct counter-examples that demonstrate that the best possible QR 
factorization will fail to achieve either (11) or (12). However, since we allow P to be a general orthonormal 
matrix, there is nothing in principle that prevents us from realizing these bounds to very high precision. 
We will in Sections 6.2 and 6.3 describe three modifications to the QR factorization scheme in Section 5 
that will make the scheme output a very high quality QR factorization satisfying (11) and (12). The final 
scheme uses the following building blocks (where b is the block size): 

• The matrix A will be interacted with only via matrix-matrix multiplies involving matrices with 
at most 2 b columns or rows, and low-rank updates. 

• We will use unpivoted QR factorizations of matrices of size at most rn x 2b or n x 2b. 

• We will compute full SVDs of n/b matrices of size at most 2b x 2b. 


6.2. Diagonalizing the diagonal blocks. In the blocked QR algorithm in Figure 3, we can at very low 
cost enforce that the diagonal block R 22 be not only upper triangular, but diagonal. All that is required 
is to replace the local QR factorization by a (full) SVD 

A 22 _ || °22 .7* 

. A 32 J i 0 J ’ 

m! x b m! x m! m! x b b x b 


( 13 ) 
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and then use U instead of Q, and V instead of P. We compute the factorization (13) via two steps: First, 
perform an unpivoted QR factorization 


A' 1 

m 22 

A' 

L M 32 J 

m ! x 6 


Q 

m ! x m ! 


R22 

0 

m ! x 6 


Observe that Q is a product of b 


Finally, compute U via 


The total cost of this step is 


1 Householder reflectors. Then compute the SVD of R 22 : 

R22 = U' d 22 V*. 

6x6 6x6 6x6 6x6 


U = Q 


U' 0 

0 I 


n 3 


C°°P 1V m b z + C svd b A + C mm m! I/. 


6.3. Power iteration to improve pivoting further. In Section 4.2, we describe a technique for finding 
an ON matrix S such that the first 6 columns of AS form suitable “pivots” in a QR factorization. A 
theoretically excellent choice for such a matrix S would be an ON matrix whose first 6 columns span the 
leading 6 right singular vectors of A. To see why, suppose that we partition S so that 

S = [Sx S 2 ] 

in such a way that Si is of size n x 6, and 

(14) VIS 2 = 0, and V^Si = 0. 

With the SVD of A partitioned as in (10), we then find that 

AS= [UiDnVJSi U 2 D 22 V£Si]. 

Then the pivot columns will span precisely the leading 6 left singular vectors, which means that after we 
apply the first 6 Householder reflectors from the left, the resulting matrix will be block diagonal. 

Now, finding a matrix S for which (14) holds is hard, since it amounts to finding the span of the leading 6 
right singular vectors of A. (The purpose of computing an RRQR is precisely to avoid this!) But finding 
an approximate span of the 6 leading singular vectors is something that randomized sampling excels at. 
The matrix Y described in Section 4.2 is built specifically so that its columns span the space we seek 
to determine. The alignment between the range of Y and the range of Vi can be further improved by 
applying a power of the remaining columns. To be precise, let X = A([I 2 ,/ 3 ], [12,1s]) denote the block of 
A that remains to be driven to upper triangular form. Then if we fix an integer q, and build a sample 
matrix 

y = (x*x) x*a 

where £2 is again a matrix with i.i.d. Gaussian entries, then as q increases, the range of Y tends to rapidly 
converge to the range of Vi (e.g. if all singular values are distinct, then such convergence can easily be 
proven). In practice, choosing q = 1 or q = 2 tends to give excellent results. 

Finally, to attain truly high accuracy, we will employ over-sampling as described in Section 4.3, but more 
aggressively than before. While in the standard QR factorization, it is fine to choose the over-sampling 
parameter p to be a fixed small integer (say p = 5 or p = 10, or even p = 0), we have empirically found 
that with p = 6, we attain an excellent alignment between the spans of Vi and Si. 


6.4. The algorithm blockRRQR. Our method for computing an RRQR is a obtained by starting with 
the blocked QR algorithm blockQR (cf. Figure 3), and then modifying it by diagonalizing the diagonal 
blocks (as described in Section 6.2), and then applying the high-accuracy pivoting scheme described in 
Section 6.3. The resulting algorithm blockRRQR is summarized in Figure 5. The sparsity pattern of the 
matrix after each step of the algorithm is shown in Figure 6. 
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Set Q = I and P = I. 
for i = 1,2,3,... ,p — 1 

— Partition the index vector / = [I\ I 2 I 3 ] so that I 2 = b{i — 1) + (1 
block processed in step i. Then partition the matrix accordingly, 


b) is the 



' An 

A12 

A13 

A = 

0 

A22 

A23 


0 

A32 

A .33 


(Observe that in the first step, I\ = [] and A has only 2x2 blocks.) 

Set X = A ([/2 I 3 ], [I 2 h])- Then determine a pivoting matrix P by process¬ 
ing (X*X) g X*, as described in Section 6.3. Then update the last two block 
columns of A accordingly, 


P. 


A) 2 

_> 

CO 

_1 


1 

> 

to 

1- 

CO 

< 

A 22 

> 

to ■" 

CO 

= 

A 22 

A 23 

1 

> 

CO ■" 

to 

A 33 


1 

> 

CO 

to 

> 

CO 

CO 

1_ 


— Execute a full SVD 


A22 


A23 
m! x b 


U 


m ! x m ! 


D 22 

0 

m' x b 


V , 

b x m' 


where m' = m — (i — 1)6 is the length of the index vector [I 2 I 3 ]. Observe 
that while U is large, it has internal structure that allows it to be applied 
efficiently, cf. Section 6.2. 


Compute the new blocks A 23 and A' 

Update the matrices A, Q, and P 

An 
0 


n 

33 via 


A" 

rt 23 

A" 

m 23 


u 


A 23 

A2.3 


A 12 v 


via 

A) 


Q • Q 


m 

0 


*ni 

0 


D 22 

0 

0 

0 

0 

p 


13 

A" 

rt 23 

A" 

rt 33 


1 

Ini 

0 

0 ' 


0 

V 

0 


0 

0 

In3 


end for 

At this point, all that remains is to process the lower right 6 x 6 block. Partition 

An A12 

0 A22 

so that A 22 is of size 6 x 6 . Observe that An is already upper triangular, 
compute the (full, but small) SVD A 22 = UD 22 V . Then update 


Now 


An 

0 


A12 

D22 


Q = Q 


ni 

0 


0 

u 


n 1 

0 


0 

V 


Figure 5. The algorithm blockRRQR. Given an input matrix A of size m x n, with m > 
n, and a block size 6 , the algorithm produces a factorization AP = QR with R upper 
triangular, and P and Q orthonormal, cf. (1). For simplicity, we assume that n = bp for 
some integer p. The matrix R overwrites A. 
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Figure 6. Example of how a 12 x 12 matrix A is driven to upper triangular form in the 
randomized RRQR factorization. Each figure shows the sparsity pattern of matrix A 
resulting after j steps of the algorithm blockRRQR, cf. Figure 5. 
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7. Numerical experiments 

In this section, we test three different methods for computing a full QR factorization of a given matrix A: 

Method 1 The Algorithm blockQR as shown in Figure 3, with the “pivoting matrix” chosen as a permutation 
matrix as described in Section 4.1. No over-sampling is used. 

Method 2 The Algorithm blockQR as shown in Figure 3, with the “pivoting matrix” chosen as a product of 
Householder reflectors as described in Section 4.2. No over-sampling is used. 

Method 3 The Algorithm blockRRQR as shown in Figure 5, with the over-sampling parameter p set to be 
half the block size b. 

We did not include over-sampling when running blockQR since numerical experiments indicated that there 
was essentially no benefit to doing so. For simplicity, A is in every experiment a real matrix of size n x n. 

At the time of writing, we have not yet implemented an optimized version of the algorithm, so we cannot 
present timing comparisons. The purpose of the numerical experiments is to demonstrate the very high 
accuracy of the proposed techniques. 

7.1. A matrix with rapidly decaying singular values. In our first experiments, we apply the various 
factorization algorithms to the matrix 

A fast = UDV*, 

with U and V unitary matrices drawn from a uniform distribution. The matrix D is diagonal, with 
diagonal entries 

D(i,i) = (io -5 ) ( *~ 1)/(n-1) . 

In other words, the singular values of A fast decay exponentially, from a\ = 1 to a n = 10~ 5 . In the 
experiments shown, we set n = 300, and use the block size b = 50. 

We first study “Method 1” (blockQR with a pivoting matrix). Figure 7 shows the error e k obtained when 
truncating the factorization, 

e k = ||A fast - A[ ast ||, where A^ 8 * = Q(:, 1 : k) R(1 : k ,:) P*. 

The figure also shows the corresponding errors when A^ ast is the truncated SVD and the truncated QR 
factorization obtained from classical column pivoting. We make three observations: 

• The errors resulting from Method 1 are very close to the errors obtained by the column pivoted 
QR factorization. 

• The errors from every truncated QR factorization involving a permutation matrix that we tried 
are substantially sub-optimal, as compared to the truncated SVD. 

• Using the “power method” described in Section 6.3 leads to almost no improvement in this case. 

Figure 8 shows how the diagonal entries of {|R(fc, fe)|}fc=i compare to the singular values {0{k, k)}^ =1 . 
The figure shows that Method 1 performs no better (and no worse) than classical column pivoted QR. 

We next use “Method 2” (blockQR with a Householder pivoting matrix) to compute the factorization (1), 
with the results shown in Figures 9 and 10. The approximation error is now much better even without 
using the power method, and once the power method is employed, results improve very rapidly. 

Finally, we test “Method 3” (blockRRQR with over-sampling parameter p = 25) to compute the factoriza¬ 
tion (1), with the results shown in Figures 9 and 10. The approximation error is now much better even 
without using the power method, and once the power method is employed, results improve very rapidly. 

In all experiments shown, we see that the randomized methods are particularly good at minimizing errors 
in the Frobenius norm. 
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Figure 7. Approximation errors from Method 1 (blockQR with a permutation matrix) 
for A = A tast , cf. Section 7.1. Also included are the errors resulting from a truncated SVD 
(which are theoretically optimal), and a column pivoted QR factorization. 



Figure 8. Comparison of |R(fc,A;)| for the matrix R resulting from Method 1, and the 
singular values of A fast . We would ideally like these values to be close. 
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Figure 9. Approximation errors from Method 2 (blockQR with a Householder pivoting 
matrix) for A = A fast , cf. Section 7.1. Also included are the errors resulting from a 
truncated SVD (which are theoretically optimal), and a column pivoted QR factorization. 



Figure 10. Comparison of |R(/c,fc)| for the matrix R resulting from Method 2, and the 
singular values of A fast . 
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Figure 11. Approximation errors from Method 3 (blockRRQR with p = 25) for A = A fast , 
cf. Section 7.1. Also included are the errors resulting from a truncated SVD (which are 
theoretically optimal), and a column pivoted QR factorization. 



Figure 12. Comparison of |R(fc, fc)| for the matrix R resulting from Method 3, and the 
singular values of A fast . 
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7.2. A random Gaussian matrix. We next repeat all experiments conducted in Section 7.1, but now 
for a square matrix A gauss with i.i.d. normalized Gaussian entries. The singular values of this matrix 
initially decay slowly, but then plummet at the end. The matrix is again of size 300 x 300, and we used 
a block size of b = 50. The results are shown in Figures 13 - 18. 

We see that the results for A guass , whose singular values decay slowly, are quite similar to those for 
A fast , whose singular values decay rapidly. The performance of Method 1 is again very similar to that of 
classical column pivoted QR, with very little benefit seen from using the power method. Once we allow 
the permutation matrix to be Householder reflectors, the errors improve greatly, in particular once the 
power method is employed. For this example, it is worth noting that for Method 3 with the power method 
engaged with q = 2, the diagonal entries of R are very close to the true singular values, even at the very 
end of the spectrum. 



17 




Figure 13. Approximation errors from Method 1 (blockQR with a permutation matrix) 
for A = A gauss , cf. Section 7.2. Also included are the errors resulting from a truncated 
SVD (which are theoretically optimal), and a column pivoted QR factorization. 



Figure 14. Comparison of |R(/c,fc)| for the matrix R resulting from Method 1, and the 
singular values of A gauss . We would ideally like these values to be close. 
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Figure 15. Approximation errors from Method 2 (blockQR with a Householder pivoting 
matrix) for A = A gauss , cf. Section 7.2. Also included are the errors resulting from a 
truncated SVD (which are theoretically optimal), and a column pivoted QR factorization. 



Figure 16. Comparison of |R(/c,fc)| for the matrix R resulting from Method 2, and the 
singular values of A gauss . 
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Figure 17. Approximation errors from Method 3 (blockRRQR withp = 25) for A = A gauss , 
cf. Section 7.2. Also included are the errors resulting from a truncated SVD (which are 
theoretically optimal), and a column pivoted QR factorization. 



Figure 18. Comparison of |R(/c,fc)| for the matrix R resulting from Method 3, and the 
singular values of A gauss . 
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7.3. A matrix with S-shaped decay in its singular values. We next repeat all experiments con¬ 
ducted in Section 7.1, but now for a square matrix A s given by 

A s = UDV*, 

where U and V are orthonormal (drawn at random from a uniform distribution), and D is a diagonal 
matrix whose entries are the singular values of A s , as shown in Figure 19. The singular values of this 
matrix are chosen to be flat for a while, then decrease rapidly, and then level out again. In other words, 
the “tail” of the singular values is now heavy and exhibits no decay, which is known to be a particularly 
challenging environment for the randomized sampling scheme. The matrix is again of size 300 x 300, and 
we used a block size of b = 50. The results are shown in Figures 19 - 24. 

The results for this example are qualitatively very similar as what we saw in Sections 7.1 and 7.2. This 
examples illustrates particularly well that the randomized schemes are much better at approximating 
matrices in the Frobenius norm than in the spectral norm. Moreover, in this example, we see a strong 
improvement in going from a permutation matrix as the pivoting matrix to Householder pivoting matrices. 
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Figure 19. Approximation errors from Method 1 (blockQR with a permutation matrix) 
for A = A s , cf. Section 7.3. Also included are the errors resulting from a truncated SVD 
(which are theoretically optimal), and a column pivoted QR factorization. 



Figure 20. Comparison of |R(/c, fc)| for the matrix R resulting from Method 1, and the 
singular values of A s . We would ideally like these values to be close. 
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Figure 21. Approximation errors from Method 2 (blockQR with a Householder pivoting 
matrix) for A = A s , cf. Section 7.3. Also included are the errors resulting from a truncated 
SVD (which are theoretically optimal), and a column pivoted QR factorization. 



Figure 22. Comparison of |R(fc, fc)| for the matrix R resulting from Method 2, and the 
singular values of A s . 
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Figure 23. Approximation errors from Method 3 (blockRRQR with p = 25) for A = A s , 
cf. Section 7.3. Also included are the errors resulting from a truncated SVD (which are 
theoretically optimal), and a column pivoted QR factorization. 



Figure 24. Comparison of |R(7’,fc)| for the matrix R resulting from Method 3, and the 
singular values of A s . 
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8 . Conclusions 

We have described techniques for efficiently computing a QR factorization AP = QR of a given matrix 
A. The main innovation is the use of randomized sampling to determine the “pivot matrix” P. The 
randomization allows us to block the factorization which we expect will substantially accelerate execution 
speed, in particular in communication constrained environments such as a matrix processed on a GPU or 
a distributed memory parallel machine, or stored out-of-core. 

We also discussed a variation of the QR factorization where we allow the matrix P to be a product of 
Householder reflectors (as opposed to a permutation matrix in the classical setting). We demonstrated 
through numerical experiments that this generalization leads to dramatic improvements in the approxi¬ 
mation error obtained by truncated factorizations. 
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